The country of Oceanus has sought FishEye International’s help in identifying companies possibly engaged in illegal, unreported, and unregulated (IUU) fishing. Using import/export trade data, FishEye’s analysts hope to understand business relationships, including finding links that will help them stop fishy IUU activities and protect marine species that are affected by it.
FishEye knows from past experience that companies caught fishing illegally will often shut down then start up again under a different name. Visualising these temporal patterns could thus help in comparing the activities of companies over time to determine if the companies have returned to their nefarious acts.
Project Objectives
This study thus aims to visualise temporal patterns for individual entities and between entities from trade records, and categorize the types of business relationship patterns found through analysis, paying particular attention to abnormalities in to detect fishy trade activities.
Notes:
- volumeteu:TEU (Twenty-foot Equivalent Unit) is a unit of measurement of shipping volume, used to determine cargo capacity for container ships and terminals.
- hscode:(Harmonized System Codes) is a standardized numerical method of identifying traded products.
1.1: Splitting data into nodes and edges
mc2_nodes <-as_tibble(mc2$nodes) %>%# reorder dataframe columnsselect(id, rcvcountry, shpcountry) %>%arrange(id)mc2_links <-as_tibble(mc2$links) %>%# Extract year from date and save as factor columnmutate(year =as_factor(year(arrivaldate))) %>%# Move Source and Target to the frontselect(source, target, hscode, year, arrivaldate, weightkg, volumeteu, valueofgoods_omu, valueofgoodsusd)
1.2: Data Health
Diagnostic checks to survey the extent of “incompleteness” of the data
# Check for columns with missing valuescolSums(is.na(mc2_nodes))
id rcvcountry shpcountry
0 2909 22359
rcvcountry has 2909 missing values, and shpcountry has 22359 missing values. There are more incomplete records pertaining to origin of the trades. To facilitate analysis, all NA values are replaced by “Unknown”:
valueofgoods_omu has too many missing values and will unnecessarily skew aggregated values. While some missing values could be intentional non-reporting of value and thus indicate possible IUU, there are too many to be able to filter out based on this criteria alone. This column will be dropped for the analysis, while the relationship between weightkg, volumeteu and valueofgoodsusd will be looked into more closely to check for possible discrepencies.
II. Checking for Duplicates:
# Check for duplicated rowsmc2_links[duplicated(mc2_links),]
# A tibble: 155,291 × 9
source target hscode year arrivaldate weightkg volumeteu valueofgoods_omu
<chr> <chr> <chr> <fct> <chr> <int> <dbl> <dbl>
1 French C… Mar d… 851629 2028 2028-03-09 20 0 NA
2 French C… Mar d… 851629 2028 2028-06-02 110 0 NA
3 French C… -1992 852990 2028 2028-01-02 2830 0 NA
4 French C… -1992 852990 2028 2028-01-02 2830 0 NA
5 French C… -1992 852990 2028 2028-01-04 2805 0 NA
6 French C… -1992 852990 2028 2028-01-11 1895 0 NA
7 French C… -1992 852990 2028 2028-01-11 2790 0 NA
8 French C… -1992 852990 2028 2028-01-11 2790 0 NA
9 French C… -1992 852990 2028 2028-01-18 1895 0 NA
10 French C… -1992 852990 2028 2028-01-18 2790 0 NA
# ℹ 155,281 more rows
# ℹ 1 more variable: valueofgoodsusd <dbl>
There are 155,291 duplicated transactions. As these may be indicative of possible fishy activity, the duplicated rows are saved separately as mc2_links_dup.
# Save duplicated transactions in a separate tibblemc2_links_dup <- mc2_links[duplicated(mc2_links),]# Leave only unique transactions in original edges tibblemc2_links <-unique(mc2_links)
1.3: Filtering and Aggregating links Variables
hscode is an internationally used 6-digit system to classify traded goods, documented by International and Governmental bodies such as the World Customs Organisation and United States ITC These different articles/products are coded as subcategories under 99 chapters, with the first 2 digits identifying the chapter that the goods are classified in.
From the code output describing the distribution of hscode numbers above (1 - 99), it is not possible to assume that the first number of each 6-digit hscode represents the chapter number identifying type of traded good, as most official hscodes start with a ‘0’. Products of interest belong to Chapter 3: Fish and Crustaceans, Molluscs and other Aquatic Invertebrates, but it is unclear if hscodes beginning with ‘3’ belong to chapter 3 or chapter 30.
The transactions recorded in mc2_links are too granular for visualisation, spread across 7 years (2028-2034). The dataframe is grouped by source, target and year-month, to get the number of transactions per traderoute (source->target) per month each year.
code block
mc2_links_agg <- mc2_links %>%# Change date to ym formatmutate(yearmonth =floor_date(as_date(arrivaldate), "month")) %>%# group by traderoute and yearmonthgroup_by( source, target, year, yearmonth) %>%summarise(# count number of transactions per source-target traderoute per product type per monthweight =n(),# Calculate total weight of export per month per traderouteweightkg =sum(weightkg),# Calculate total volume in teu per month per traderoutevolumeteu =sum(volumeteu),# Calculate total value in usd per month per traderoutevalueusd =sum(valueofgoodsusd) ) %>%filter(# Filter to keep traderoutes with more than one transaction weight >1,# Filter to keep traderoutes between different companies only source != target) %>%arrange( year, yearmonth) %>%ungroup()
Summary StatisticsI. Trends in number of unique traderoutes (sourcetarget) by Year
code block
# Aggregating number of routes into new dataframeroutes_by_year <- mc2_links_agg %>%# Change month to charactermutate(yearmonth =format(yearmonth,"%b")) %>%# Group by year and month to count frequency of trade routesgroup_by(year, yearmonth) %>%summarise(nroutes =n(),avgkg =sum(weightkg)/nroutes) %>%# Sort by Year-montharrange(year, # Manually assign ordinal levels to month to retain orderfactor(yearmonth, levels =c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))) %>%ungroup()# Create pivot table to get years as columns rby_pivot <- routes_by_year %>%# Remove avgkg column select(-avgkg) %>%pivot_wider(names_from = year,values_from = nroutes)# Manually set order of months in new pivot tablerby_pivot$yearmonth <-factor(rby_pivot$yearmonth, levels =c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))# List columns (years) to use in plot rby_columns <-colnames(rby_pivot)# Setting default theme for all plotsset_urbn_defaults("print", base_family ="Lato")# Specify annotationnote1 <-list(xref ='paper',yref ='y',x =0.23,y =8800,xanchor ='left',yanchor ='middle',text ="Abnormal Increase in Mar 2030",font =list(size =12,# color-code annotation to match line colorcolor ='rgb(85,183,72)'),showarrow =FALSE)# Plot trend linesrby <-plot_ly( rby_pivot, x =~yearmonth ) %>%# Add Trendlines by Yearadd_trace(y =~rby_pivot[[rby_columns[2]]],type ='scatter', mode ='lines+markers', name ="2028" ) %>%add_trace(y =~rby_pivot[[rby_columns[3]]],type ='scatter', mode ='lines+markers', name ="2029" ) %>%add_trace(y =~rby_pivot[[rby_columns[4]]],type ='scatter', mode ='lines+markers', name ="2030" ) %>%add_trace(y =~rby_pivot[[rby_columns[5]]],type ='scatter', mode ='lines+markers', name ="2031" ) %>%add_trace(y =~rby_pivot[[rby_columns[6]]],type ='scatter', mode ='lines+markers', name ="2032" ) %>%add_trace(y =~rby_pivot[[rby_columns[7]]],type ='scatter', mode ='lines+markers', name ="2033" ) %>%add_trace(y =~rby_pivot[[rby_columns[8]]],type ='scatter', mode ='lines+markers', name ="2034" ) %>%layout(title ="Increasing Trend in Traderoute Count in 2029 and 2033", yaxis =list(title ='Number of Traderoutes'),# Specify blank title to remove x-axis titlexaxis =list(title =""),# Add top margin to prevent title from being truncatedmargin =list(t =80),# Add annotation to plotannotations = note1,# Specify plot and background colorsplot_bgcolor ="#F8F3E6",paper_bgcolor ="#F8F3E6" ) rby
II. Trends in Average Weight (kg) of Exports by Year
code block
# Create pivot table to get years as columns kgby_pivot <- routes_by_year %>%# Remove nroutes column select(-nroutes) %>%pivot_wider(names_from = year,values_from = avgkg)# Manually set order of months in new pivot tablekgby_pivot$yearmonth <-factor(kgby_pivot$yearmonth, levels =c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))# List columns (years) to use in plot kgby_columns <-colnames(kgby_pivot)# Plot trend lineskgby <-plot_ly( kgby_pivot, x =~yearmonth ) %>%# Add Trendlines by Yearadd_trace(y =~kgby_pivot[[kgby_columns[2]]],type ='scatter', mode ='lines', name ="2028" ) %>%add_trace(y =~kgby_pivot[[kgby_columns[3]]],type ='scatter', mode ='lines', name ="2029" ) %>%add_trace(y =~kgby_pivot[[kgby_columns[4]]],type ='scatter', mode ='lines', name ="2030" ) %>%add_trace(y =~kgby_pivot[[kgby_columns[5]]],type ='scatter', mode ='lines', name ="2031" ) %>%add_trace(y =~kgby_pivot[[kgby_columns[6]]],type ='scatter', mode ='lines', name ="2032" ) %>%add_trace(y =~kgby_pivot[[kgby_columns[7]]],type ='scatter', mode ='lines', name ="2033" ) %>%add_trace(y =~kgby_pivot[[kgby_columns[8]]],type ='scatter', mode ='lines', name ="2034" ) %>%layout(title ="Fluctuations in Average Trade Weights Across each Year", yaxis =list(title ='Average Weight of Trade (kg)'),# Specify blank title to remove x-axis titlexaxis =list(title =""),# Add top margin to prevent title from being truncatedmargin =list(t =80),# Specify plot and background colorsplot_bgcolor ="#F8F3E6",paper_bgcolor ="#F8F3E6" )kgby
Insights from LineChart:
Abnormal increase in traderoute count in Mar 2030
No regularly coinciding ‘peaks’ to suggest seasonal trends (through changes in export weight) across the years
To determine if there are specific companies or traderoutes accounting for a high percentage of these spikes in trade volume, it would be useful to visualise and compare tradecount and export weight of the highest contributors over the years. This is to investigate if the spikes are attributed to specific companies or traderoutes, or just an overall increase in numbers or trade volume across the board (pointing to increases in demand instead of fishy activity).
2: ExploringFishy Data Patterns
From the Summary Statistics and preliminary visualisations in Section 1.4, traderoutes will be investigated based on abnormalities in temporal trends across number of trades and export weight. The analysis will focus on visualising networks based on filtered records:
%%{
init: {
'theme': 'base',
'themeVariables': {
'primaryColor': '#93c7c2',
'primaryTextColor': '#fff',
'primaryBorderColor': '#3d7670',
'lineColor': '#3d7670',
'secondaryColor': '#3d7670',
'tertiaryColor': '#fff'
}
}
}%%
flowchart LR
A{fa:fa-fish-fins \nOverall\nnetwork} ==> B((Country\nLevel))
A ==> C((Company\nLevel))
C -.- D{fa:fa-calendar-days \nTime} -->|patterns?| E(trade frequency)
B -.- D -->|patterns?| F(trade weight)
D -->|patterns?| G(trade value\n$usd)
2.1 Country Level Trade Networks
In order to get plot a network graph detailing links between countries, new links and nodes dataframes have to be created from existing data.
country_links <- mc2_links_agg %>%# Match source company to nodes to get shipping origin countryinner_join(mc2_nodes, by =join_by("source"=="id")) %>%# Use shpcountry as source, rcvcountry as target for linksselect(shpcountry, rcvcountry, year, yearmonth, weight, weightkg) %>%filter(shpcountry != rcvcountry)
2.1.1: Investigating 2030 data
From the summary statistics outlined in Section 1, there was an observed abnormal increase in tradecount in March 2030. Trade records from 2030 are thus filtered out for investigation:
# A tbl_graph: 137 nodes and 4439 edges
#
# A directed multigraph with 1 component
#
# A tibble: 137 × 1
country
<chr>
1 -22004
2 -22005
3 -22007
4 Alverossia
5 Alverovia
6 Andenovia
# ℹ 131 more rows
#
# A tibble: 4,439 × 5
from to yearmonth weight weightkg
<int> <int> <date> <int> <dbl>
1 1 75 2030-01-01 6 580805
2 1 75 2030-02-01 9 1732215
3 1 75 2030-03-01 2 252950
# ℹ 4,436 more rows
Step 4: Visualise country trade network for 2030:
code block
# Measure directed out-degree centrality and save as a columnV(country_graph_2030)$out_degree <-degree(country_graph_2030, mode ="out")set.seed(1234)g2030 <- country_graph_2030 %>%activate(edges) %>%mutate(yearmonth =factor(format(yearmonth, "%b"),levels =c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))) %>%ggraph(layout ="linear",circular =TRUE) +geom_edge_fan(aes(width = weight,color = yearmonth),alpha = .6,arrow =arrow(length =unit(2, 'mm')),show.legend =FALSE ) +scale_edge_width(range =c(0.1,4) ) +geom_node_point(aes(size = out_degree),color ="grey20",alpha = .7,show.legend =FALSE ) +geom_node_text(aes(label =ifelse(out_degree >quantile(out_degree, .75), country, "")), size =3,repel =TRUE ) +theme(plot.title =element_text(size =20,color ="grey20"),legend.title =element_text(),plot.background =element_rect(fill="#F8F3E6",colour="#F8F3E6") ) +transition_states(yearmonth,transition_length =6,state_length =4 ) +labs(title ="Country-level Trade Network in {closest_state} 2030",subtitle ="Larger nodes have higher Out-Degree Centrality. Countries with the most export routes are identified: " ) +enter_fade() +exit_fade()g2030
Insights from Network Graph:
There are no visually fishy patterns identified in March 2030, compared to the other months. The links with thicker widths represent higher number of traderoutes – these have largely remained the same throughout the year. This suggests that the abnormal increase could be detected at a company-level instead of a country-level network analysis. However, more information is first gathered from country-level traderoute activities to sieve out possible flags for IUU activity.
2.1.2: Visualising Records Across the Years
Which countries have the heaviest trade activities? Plotting an overall network graph would result in a dense web of links with little analytical value. To detect fishy temporal patterns, countries that have higher traderoute counts are used for the network analysis.
2.1.2: Comparing Distribution of Centrality Measures
Data Preparation:
code block
# Retrieve Centrality measures from nodescountry_centrality <-data.frame(OutDegree =V(country_graph)$out_degree,InDegree =V(country_graph)$in_degree,Betweenness_centrality =V(country_graph)$betweenness,Eigenvector_centrality =V(country_graph)$eigencentrality,PageRank_score =V(country_graph)$pagerank)# Create function to transform variables to same scale (min-max normalisation)transform_variable <-function(x) { (x -min(x)) / (max(x) -min(x))}# function to apply scaling across all variables in a dataframetransform_dataframe <-function(df) { df %>%mutate(across(where(is.numeric) &!matches("uuid"), transform_variable))}# Apply function and save to new dataframecountry_centrality_scaled <-transform_dataframe(country_centrality)# Pivot longer to get centrality measures as factorscountry_centrality_long <-gather(country_centrality_scaled, key ="CentralityMeasure", value ="CentralityScore")
code block
# Density ridges to show distribution of dataggplot( country_centrality_long, aes(x = CentralityScore, y = CentralityMeasure, fill = CentralityMeasure,color = CentralityMeasure) ) +geom_density_ridges(alpha = .6,scale =3 ) +geom_rug() +labs(title ="Similar Distribution of Values Across Centrality Measures",subtitle ="Right-skewed distribution with presence of outliers",x ="Centrality Score",y ="Centrality Measure" ) +theme(legend.position ="none",axis.title =element_blank(),panel.grid.major =element_blank(),plot.background =element_rect(fill="#F8F3E6",colour="#F8F3E6") )
Picking joint bandwidth of 0.0303
Insights from Network Graphs:
2.2 Company Level Trade Networks: Top Traderoutes by Frequency
%%{
init: {
'theme': 'base',
'themeVariables': {
'primaryColor': '#93c7c2',
'primaryTextColor': '#fff',
'primaryBorderColor': '#3d7670',
'lineColor': '#3d7670',
'secondaryColor': '#3d7670',
'tertiaryColor': '#fff'
}
}
}%%
flowchart LR
A{fa:fa-fish-fins \nOverall\nnetwork} ==> B((Country\nLevel))
A ==> C((Company\nLevel)) -.-|filter| B
C -.- D{fa:fa-calendar-days \nTime} -->|patterns?| E(trade frequency)
B -.- D -->|patterns?| F(trade weight)
D -->|patterns?| G(trade value\n$usd)
# Extract centrality metrics from country graph and save into new data framecountry_filter <-data.frame(country =V(country_graph)$country,OutDegree =V(country_graph)$out_degree,InDegree =V(country_graph)$in_degree,Betweenness_centrality =V(country_graph)$betweenness,Eigenvector_centrality =V(country_graph)$eigencentrality,PageRank_score =V(country_graph)$pagerank)# Define function to filter each variable by percentilepercentile_filter <-function(x) { x >=quantile(x, .75)}# Filter the dataframe to retrieve list of countries country_filter <- country_filter %>%filter(percentile_filter(OutDegree) |percentile_filter(InDegree) |percentile_filter(Betweenness_centrality) |percentile_filter(Eigenvector_centrality) |percentile_filter(PageRank_score) )
Step 3: Aggregate the frequency of traderoutes in filtered coutries
routes_by_count <- mc2_links_agg %>%# Ensure that companies in the dataframe are from top trading countries filter(source %in% mc2_nodes_filtered$id | target %in% mc2_nodes_filtered$id) %>%group_by(source, target) %>%filter(source != target) %>%# Get count of trade routesummarise(count =sum(weight)) %>%# Arrange in descending order to get top routes firstarrange(desc(count)) %>%ungroup()datatable(routes_by_count)
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
Insights from Table:
There seems to be many common source and target companies appearing across various traderoutes. This suggests that IUU fishing activity could be detected through visualising this information as a network and using centrality measures to determine key players. Since companies caught fishing illegally shut down but often start up again under a different company, identifying fishy companies by comparing their trading partners over the years could help determine if they are engaging in illegal acts.
The density raincloud plots of traderoute counts reveal the presence of outliers with values far above the median and 95th percentile value. This suggets that abnormal fishing activity could be investigated by filtering out traderoutes with very high frequency. Investigating by using countries as filters, as well as companies beyond the top 95th percentile of traderoutes could thus give some insight into IUU fishing networks.
2.3: Who are the Top players in the Trade Network by Tradecount?
links_sorted <- mc2_links_agg %>%# Only keep companies in suspicious countriesfilter(source %in% mc2_nodes_filtered$id | target %in% mc2_nodes_filtered$id) %>%# Group and calculate total number of tradecounts per yeargroup_by(source, target, year) %>%summarise(weight =sum(weight)) %>%# Arrange data in order of year and descending order of tradecounts to get toparrange(year, desc(weight)) %>%ungroup()# Filter out top 30 per yeartop_30 <- links_sorted %>%group_by(year) %>%# Get top 30 companies by weighttop_n(30, weight) %>%ungroup()
# Create nodes dataframe for top 10 exporters from oceanusttdistinct_source <- top_30 %>%distinct(source)ttdistinct_target <- top_30 %>%distinct(target)# Select only overlapping nodes and distinct companies from linksttnodes_source <-inner_join( ttdistinct_source, mc2_nodes,by =c("source"="id")) %>%mutate(id = source)ttnodes_target <-inner_join( ttdistinct_target, mc2_nodes,by =c("target"="id")) %>%mutate(id = target)# Create new nodes data by combining filtered dataframestop_30_nodes <-bind_rows(ttnodes_source, ttnodes_target) %>%left_join(source_count, by =c("id"="source")) %>%select(id, uuid, shpcountry, rcvcountry, sourcecount)
In the plot above, the relative sizes of the nodes are based on Out-degree centrality, where larger nodes have a higher number of out-going links. This enables us to visually determine companies with higher number of exporting links over time. Subsequent plots using In-degree centrality and betweenness centrality instead are shown below:
# Apply function to normalize values in dataframetop30_centrality_scaled <- top30_centrality %>%select(-uuid)top30_centrality_scaled <-transform_dataframe(top30_centrality_scaled)# Pivot longer to get centrality measures as factorstop30_centrality_long <-gather(top30_centrality_scaled, key ="CentralityMeasure", value ="CentralityScore")
code block
# Density ridges to show distribution of dataggplot( top30_centrality_long, aes(x = CentralityScore, y = CentralityMeasure, fill = CentralityMeasure,color = CentralityMeasure) ) +geom_density_ridges(alpha = .6,scale =2 ) +geom_rug() +annotate(geom ="text",x = .7,y =4.5,label ="Highly right-skewed distribution with smaller peaks" ) +labs(title ="Similar Distribution of Values Across Centrality Measures",x ="Centrality Score",y ="Centrality Measure" ) +theme(legend.position ="none",axis.title =element_blank(),panel.grid.major =element_blank(),plot.background =element_rect(fill="#F8F3E6",colour="#F8F3E6") )
Picking joint bandwidth of 0.0279
Insights from Network Graphs and Centrality Measure Distributions:
From the network graphs, we can see that key nodes are unchanged for the entire time period (2028-2034), but heavier traderoutes (identified by thicker links) seem to vary from key nodes through the years. This could be indicative of IUU activity, with fishy trading partners. The distribution graphs show intensely skewed distributions of centrality measures. This enables us filter out companies based on out-degree centrality > 10 and betweenness centrality > 10 to identify the top exporters by tradecount.
2.4: Who are the Top 30 exporters from Oceanus by export weight per Year?
Insights from Network Graphs and Centrality Measure Distributions:
Comparing the distribution of centrality measures, there are significantly more nodes with high degree but low betweenness centrality. This suggests that there are clusters of nodes in the network, and that these nodes are well-connected within their clusters but not to the rest of the nodes in other clusters. This could be indicative of various smaller IUU networks operating within the overall network.
2.5: What are the most heavily traded products?
Using hscode (product code), we can filter the most frequently traded products by number of transactions, total export weight, as well as monetary value.
links_top_year <- mc2_links_top %>%group_by( source, target, year) %>%summarise(# Number of active months in the yearmonths_active =n(),# Total number of transactions per yearweight =sum(weight),# Avg weight of trade per monthavg_weightkg =round((sum(weightkg)/months_active),0)) %>%ungroup()
`summarise()` has grouped output by 'source', 'target'. You can override using
the `.groups` argument.
code block
datatable(links_top_year)
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
code block
# filter distinct source and target from mc2_edgesdistinct_source <- links_top_year %>%distinct(source)distinct_target <- links_top_year %>%distinct(target)# Select only overlapping nodes and distinct companies from linksnodes_source <-inner_join( distinct_source, mc2_nodes,by =c("source"="id")) %>%mutate(id = source)nodes_target <-inner_join( distinct_target, mc2_nodes,by =c("target"="id")) %>%mutate(id = target)# Create new nodes data by combining filtered dataframesnodes_new <-bind_rows(nodes_source, nodes_target) %>%select(id, shpcountry, rcvcountry)
code block
# Combine edges and nodes to a graph dataframelinks_top_vis <- links_top_year %>%rename("from"="source", "to"="target") %>%select(from, to, year, months_active, weight, avg_weightkg)